Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

The bible, truth, and multilingual OCR evaluation

Identifieur interne : 002027 ( Main/Exploration ); précédent : 002026; suivant : 002028

The bible, truth, and multilingual OCR evaluation

Auteurs : T. Kanungo [États-Unis, Japon] ; P. Resnik [Japon, États-Unis]

Source :

RBID : Pascal:99-0297905

Descripteurs français

English descriptors

Abstract

Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is difficult to compare the performance of these OCR algorithms across languages. This difficulty arises because most evaluation methodologies rely on the use of a document image dataset in each of these languages and it is difficult to find document datasets in different languages that are similar in content, layout, and fonts. In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">The bible, truth, and multilingual OCR evaluation</title>
<author>
<name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Center for Automation Research, University of Maryland</s1>
<s2>College Park, MD 20742 </s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName>
<settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
<author>
<name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName>
<settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4">
<inist:fA14 i1="03">
<s1>Department of Linguistics, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">99-0297905</idno>
<date when="1999">1999</date>
<idno type="stanalyst">PASCAL 99-0297905 INIST</idno>
<idno type="RBID">Pascal:99-0297905</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000821</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000B73</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000755</idno>
<idno type="wicri:doubleKey">1017-2653:1999:Kanungo T:the:bible:truth</idno>
<idno type="wicri:Area/Main/Merge">002136</idno>
<idno type="wicri:Area/Main/Curation">002027</idno>
<idno type="wicri:Area/Main/Exploration">002027</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">The bible, truth, and multilingual OCR evaluation</title>
<author>
<name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Center for Automation Research, University of Maryland</s1>
<s2>College Park, MD 20742 </s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName>
<settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
<author>
<name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
<affiliation wicri:level="4">
<inist:fA14 i1="02">
<s1>Institute for Advanced Computer Studies, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>JPN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Japon</country>
<placeName>
<settlement type="city">College Park (Maryland)</settlement>
<region type="state">Maryland</region>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
<affiliation wicri:level="4">
<inist:fA14 i1="03">
<s1>Department of Linguistics, University of Maryland</s1>
<s2>College Park, MD 20742</s2>
<s3>USA</s3>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<region type="state">Maryland</region>
<settlement type="city">College Park (Maryland)</settlement>
</placeName>
<orgName type="university">Université du Maryland</orgName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
<imprint>
<date when="1999">1999</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">SPIE proceedings series</title>
<idno type="ISSN">1017-2653</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Document analysis</term>
<term>Document image processing</term>
<term>Document retrieval</term>
<term>Multilingualism</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Traitement image document</term>
<term>Reconnaissance optique caractère</term>
<term>Reconnaissance forme</term>
<term>Recherche documentaire</term>
<term>Analyse documentaire</term>
<term>Multilinguisme</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Recherche documentaire</term>
<term>Multilinguisme</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is difficult to compare the performance of these OCR algorithms across languages. This difficulty arises because most evaluation methodologies rely on the use of a document image dataset in each of these languages and it is difficult to find document datasets in different languages that are similar in content, layout, and fonts. In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Japon</li>
<li>États-Unis</li>
</country>
<region>
<li>Maryland</li>
</region>
<settlement>
<li>College Park (Maryland)</li>
</settlement>
<orgName>
<li>Université du Maryland</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="Maryland">
<name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
</region>
<name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
</country>
<country name="Japon">
<region name="Maryland">
<name sortKey="Kanungo, T" sort="Kanungo, T" uniqKey="Kanungo T" first="T." last="Kanungo">T. Kanungo</name>
</region>
<name sortKey="Resnik, P" sort="Resnik, P" uniqKey="Resnik P" first="P." last="Resnik">P. Resnik</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002027 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002027 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:99-0297905
   |texte=   The bible, truth, and multilingual OCR evaluation
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024